Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(jsonish): Add support for curly quotes in JSON parsing #1249

Open
wants to merge 4 commits into
base: canary
Choose a base branch
from

Conversation

revidious
Copy link
Contributor

@revidious revidious commented Dec 16, 2024

Description

LLMs will sometimes output JSON with curly quotes (U+201C and U+201D) instead of straight quotes (U+0022). This PR adds automatic normalization of these quotes during parsing, making BAML more robust when handling LLM outputs.

Changes

  • Added normalize_quotes() function that converts:
    • Left double quotation mark (U+201C) → Basic quotation mark (U+0022)
    • Right double quotation mark (U+201D) → Basic quotation mark (U+0022)
  • Applied normalization before all parsing attempts:
    • Standard JSON parsing (serde_json)
    • Markdown-embedded JSON
    • Multiple JSON objects
    • JSON fixing parser
  • Preserved original string for error messages and output

Fixes #1074


Important

Add support for curly quotes in JSON parsing and implement custom SIGINT handling for graceful shutdowns.

  • JSON Parsing:
    • Added normalize_quotes() in entry.rs to convert curly quotes (U+201C, U+201D) to standard quotes (U+0022).
    • Applied normalization in parse() before parsing JSON, Markdown-embedded JSON, multiple JSON objects, and JSON fixing parser.
  • Signal Handling:
    • Implemented custom SIGINT handling in lib.rs using ctrlc crate to ensure graceful shutdown across Python/Rust boundary.
    • Added dependencies libc, ctrlc, and tokio-util in Cargo.toml for signal handling support.

This description was created by Ellipsis for 0513ae7. It will automatically update as commits are pushed.

This implements a custom signal handling mechanism using the ctrlc crate to: 1. Bypass Python's signal handling 2. Provide consistent behavior across platforms 3. Ensure graceful shutdown with proper exit codes

While this is a workaround, it provides reliable interrupt handling without requiring major architectural changes to BAML's signal handling.
LLMs sometimes output JSON with curly quotes (U+201C/U+201D) instead of straight quotes (U+0022). This change adds automatic normalization of these quotes during parsing, making the JSON parser more robust to LLM output variations.

Changes:\n- Add normalize_quotes() function to convert curly quotes to straight quotes\n- Apply normalization before all parsing attempts (serde, markdown, multi-json)\n- Preserve original string for error messages and output\n\nFixes BoundaryML#1074
Copy link

vercel bot commented Dec 16, 2024

@revidious is attempting to deploy a commit to the Gloo Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 0513ae7 in 1 minute and 6 seconds

More details
  • Looked at 151 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 drafted comments based on config settings.
1. engine/baml-lib/jsonish/src/jsonish/parser/entry.rs:41
  • Draft comment:
    Consider normalizing quotes once and reusing the result to avoid redundant operations.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The normalize_quotes function is correctly implemented to replace curly quotes with straight quotes. However, the function is called multiple times in the parse function, which could lead to unnecessary performance overhead. It would be more efficient to call normalize_quotes once and reuse the result.
2. engine/language_client_python/src/lib.rs:32
  • Draft comment:
    Consider documenting any potential limitations or side effects of using the ctrlc crate for SIGINT handling.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The ctrlc crate is used to handle SIGINT signals, which is a good approach for ensuring graceful shutdowns. However, the comment mentions that this is a hacky solution. It would be beneficial to document any potential limitations or side effects of this approach in the code comments.

Workflow ID: wflow_2kgUvhKu6kb7Dltf


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@imalsogreg
Copy link
Contributor

@revidious Thanks for this PR, it looks great!

Regarding curly quotes, do you think it's better to do normalization as a preprocessing step, as you have done in the PR? Or should we add curly quotes as another type of quotation mark, alongside the parsers for quoted string, single-quoted string, and backtick-string, here: https://github.com/BoundaryML/baml/blob/canary/engine/baml-lib/jsonish/src/jsonish/parser/fixing_parser/json_parse_state.rs#L540 ?

Copy link
Contributor

@hellovai hellovai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revidious regarding gregs comment, I would highly recommend doing it with that approach rather than a preprocessing step. Additoinally use the AnyOf capability in the preprocessor as well, since it could very intentionally be an actual curly quote as intended by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parser Improvement: Support ” along with "
4 participants